speech-to-speech translation
- Europe > Austria > Vienna (0.14)
- Asia > South Korea > Incheon > Incheon (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (12 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (3 more...)
RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data
Zheng, Zhisheng, Sun, Xiaohang, Dinh, Tuan, Yanamandra, Abhishek, Jain, Abhinav, Liu, Zhu, Hadap, Sunil, Bhat, Vimal, Aggarwal, Manoj, Medioni, Gerard, Harwath, David
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.
Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data
Rashidi, Sina, Sameti, Hossein
Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Spain (0.04)
- Europe > Austria > Styria > Graz (0.04)
- Asia (0.04)
- Materials > Metals & Mining (0.46)
- Government (0.46)
StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation
Chen, Xi, Song, Yuchen, Nakamura, Satoshi
EmphST -Bench To guide algorithm exploration and evaluate the performance of our model, we design an evaluation pipeline for the emphasis-preserving speech-to-speech translation system. Given the lack of ready-to-use benchmarks for this important task, we leverage LLMs to translate the test set from the StressTest [21] corpus into the target language and then filter the results via human experts. This process creates a high-quality benchmark dataset, EmphST -Bench, with manually verified emphasis alignments between source and target utterances, ensuring reliable assessment of cross-lingual emphasis preservation. The human filtering step focuses on correcting any discrepancies in semantic equivalence, contrastive focus, and emotional intensity, resulting in a robust evaluation set that closely mirrors real-world linguistic nuances. EmphST -Bench consists of carefully selected parallel samples from English (source) to Chinese (target), providing a standardized resource for evaluating stress-aware S2ST systems. We report the statistics of EmphST -Bench in Table. 1. T able 1: Statistics of the EmphST -Bench dataset.Statistic V alue Number of Samples 218 Avg.
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Asia > China > Hong Kong (0.04)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)
MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction
Wang, Jianjin, Zhao, Runsong, Liu, Xiaoqian, Ge, Yuan, Xu, Ziqiang, Xiao, Tong, Gao, Shengxiang, Yu, Zhengtao, Zhu, Jingbo
Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.
- Asia > China > Yunnan Province > Kunming (0.04)
- Asia > China > Liaoning Province > Shenyang (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (7 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (3 more...)
- Europe > Austria > Vienna (0.14)
- Asia > South Korea > Incheon > Incheon (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- (12 more...)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents
We present Speech Vecalign, a parallel speech document alignment method that monotonically aligns speech segment embeddings and does not depend on text transcriptions. Compared to the baseline method Global Mining, a variant of speech mining, Speech Vecalign produces longer speech-to-speech alignments. It also demonstrates greater robustness than Local Mining, another speech mining variant, as it produces less noise. We applied Speech Vecalign to 3,000 hours of unlabeled parallel English-German (En-De) speech documents from VoxPopuli, yielding about 1,000 hours of high-quality alignments. We then trained En-De speech-to-speech translation models on the aligned data. Speech Vecalign improves the En-to-De and De-to-En performance over Global Mining by 0.37 and 0.18 ASR-BLEU, respectively. Moreover, our models match or outperform SpeechMatrix model performance, despite using 8 times fewer raw speech documents.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > Canada > Ontario > Toronto (0.04)
- (16 more...)
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.47)